Introduction

A common problem with Bayesian fine-mapping approaches is the large number of variants contained in the credible set. Moreover, we have identified instances of over- and under- coverage, whereby less or more variants respectively are needed in the credible set to accurately reach the \(\alpha\) credible set size (often \(\alpha=95\%\)). We hope to amend this problem by developing a correction factor, which when applied to the data, will rectify instances of over- and under- coverage.

Schaid et al. (2018) state that the expected posterior probability depends on the effect size of the causal SNP on a trait (OR), the sample size (N), the number of SNPs (nsnps) and SNP correlation structure (LD). Since experimentors know the sample size, the number of SNPs analysed and have some knowledge of the correlation structure, I wish to find a way for experimentors to quantify the information OR provides to the system, without actually having to know the OR. This could then be used in the correction factor to improve coverage estimates in Bayesian fine-mapping experiments.

Method

Firstly, low, medium and high power systems were constructed to understand the shape of the posterior probability systems and consider ways the OR affects this shape.

Secondly, possible measures of ‘entropy’ are considered. These are measures that I hope quantify the shape of the posterior probability system and the information that the OR provides. These are calculated and plotted for the model systems.

Thirdly, a simulation dataset is created with N, nsnps, thr, MAF and LD fixed. This means that I am considering simulations whereby the variation in the power is only due to variation in the OR. The entropy measures are calculated for each simulated system. Logistic regression and random forest methods are used to analyse the effectiveness of the entropy measures in predicting coverage.

Finally, I simulate posterior probability systems with everything fixed except OR. I look at the distribution of the entropy measures for these simulations.


1. Construct model low, medium and high power systems

Model posterior probability systems were produced for low, medium and high power systems.


Low power system (maf=0.05, OR=1, N0=N1=50)

# x.low <- ref(maf=0.05)
# test.low <- simdata_x(x.low, OR=1, N0=50, N1=50) # get 100 systems
# t.low <- test.low[[1]][order(-test.low[[1]]$PP),] 

setwd("/Users/anna/PhD")
t.low <- read.table("t.low")
head(t.low)
##        snp      pvalues   MAF         PP    CV
## s65 SNP.65 0.0005938498 0.055 0.01647548 FALSE
## s58 SNP.58 0.0013229541 0.043 0.01413227 FALSE
## s1   SNP.1 0.0231507202 0.060 0.01223052 FALSE
## s7   SNP.7 0.0362587474 0.062 0.01181555 FALSE
## s91 SNP.91 0.0389514797 0.057 0.01160966 FALSE
## s33 SNP.33 0.0425323171 0.053 0.01142506 FALSE

The plots below show the pdf and the cdf of the posterior probabilities for this low power system.


Medium power system (maf=0.3, OR=1.1, N0=N1=700)

# x.med <- ref(maf=0.3)
# test.med <- simdata_x(x.med, OR=1.1, N0=700, N1=700) # get 100 systems
# t.med <- test.med[[1]][order(-test.med[[1]]$PP),] 

setwd("/Users/anna/PhD")
t.med <- read.table("t.med")
head(t.med)
##        snp    pvalues   MAF         PP    CV
## s27 SNP.27 0.01129078 0.294 0.08274654 FALSE
## s76 SNP.76 0.01650056 0.309 0.06204450 FALSE
## s64 SNP.64 0.01858309 0.279 0.05701422 FALSE
## s68 SNP.68 0.04594892 0.300 0.02921609 FALSE
## s42 SNP.42 0.06025226 0.302 0.02405483 FALSE
## s14 SNP.14 0.07338729 0.298 0.02097193 FALSE

The plots below show the pdf and the cdf of the posterior probabilities for this medium power system.


High power system (maf=0.5, OR=1.3, N0=N1=1000)

# x.high <- ref(maf=0.5)
# test.high <- simdata_x(x.high, OR=1.3, N0=1000, N1=1000) # get 100 systems
# t.high <- test.high[[1]][order(-test.high[[1]]$PP),] 

setwd("/Users/anna/PhD")
t.high <- read.table("t.high")
head(t.high)
##        snp      pvalues   MAF           PP    CV
## s56 SNP.56 2.394407e-06 0.514 0.9841670039  TRUE
## s70 SNP.70 1.163447e-03 0.489 0.0048218359 FALSE
## s55 SNP.55 1.948468e-03 0.497 0.0031308545 FALSE
## s87 SNP.87 2.212163e-02 0.495 0.0004312507 FALSE
## s57 SNP.57 3.626935e-02 0.501 0.0002927989 FALSE
## s72 SNP.72 3.815923e-02 0.498 0.0002815128 FALSE

The plots below show the pdf and the cdf of the posterior probabilities for this high power system.


The following figure shows all three model systems on the same plot. Note the scale on the y axis, meaning we cannot see the first snp in the high power system (with probability 0.984).

A green dashed line has been added to illustrate the case where all posterior probabilities are equal (base level low power system). We could consider using this line to quantify the differences in shapes of posterior probability systems, with low power systems being close to this line and high power systems being far from this line.


Weighted exponential curves

Weighted exponential curves were fit to the pdfs of each system. The weight for the first snp (the one with the highest posterior probability) was set to 1000, to ensure the curve incorporated this point into the fitting.

set.seed(2)
we <- rep(1,100)
we[1] <- 1000 # set weight of first snp to 100
snps <- seq(1,100,1)

# fit exp model: low
exp.low <- lm(log(t.low$PP)~snps, weights=we)

# fit exp model: med
exp.med <- lm(log(t.med$PP)~snps, weights=we)

# fit exp model: high
exp.high <- lm(log(t.high$PP)~snps, weights = we)

The plot below shows the points and the fitted exponential curves.

The equations are given below:

Low:

\(pp = 0.01644\times exp(-0.00796x)\)

Medium:

\(pp = 0.08268\times exp(-0.03847x)\)

High:

\(pp = 0.93228\times exp(-0.14631x)\)

Perhaps these equations could be use to quantify the disorder of the system, and the information OR provides to it.


2. Measures of entropy

Different measures were considered to try and capture the disorder of the system.

  1. ‘Entropy’: \(-mean(log(pp))\)

  2. ‘Max’: \(max(pp)\)

  3. ‘Median’: \(median(pp)\)

  4. ‘AUCDF’: Area under the cdf

  5. ‘KL divergence’: \(mean(log(pp/q))\) where q=rep(1/nsnps,nsnps).

  6. ‘Slope’: The slope of the weighted exponential curve fitted to the pdf of the posterior probabilities.

  7. ‘Intercept’: The intercept of the weighted exponential curve fitted to the pdf of the posterior probabilities (\(exp(B_0)\)).

  8. ‘Hellinger’: The hellinger distance between the pp prob distributions (P) and q (Q), \(\frac{1}{\sqrt{2}}||\sqrt{P}-\sqrt{Q}||_2\).

  9. Kolmogorov distance: The sup-distance between the pp prob distributions and q.


The table below shows the values for these measures for the model systems above.


Low Medium High
-mean(log(pp)) 4.6076 4.8280 9.4613
max(pp) 0.0137 0.0827 0.9842
median(pp) 0.0097 0.0068 5.4e-05
Area under cdf 0.0038 0.0728 0.9742
KL divergence -0.302 -0.223 -4.856
Slope -0.007 -0.034 -0.124
Intercept 0.0164 0.0826 0.9322
Hellinger 0.0314 0.2543 0.8978
Kolmogorov 0.0283 0.2767 0.9742

Nb, the intercept is exp(B_0) from the fitted weighted exponential models.


Plot these results (low, medium then high)

The measures seem good at distinguishing between low, medium and high power systems.


3. Simulation for varying OR

A new simulation dataset (20000 entries) was made with fixed N = 1000, nsnps = 100, maf = 0.5, LD1 = 0.3, LD2 = 0.2 and thr = 0.8.


1. Baseline model for unordered

  • AIC: 9019.5
  • r squared adjusted: 0.01853919
library(rsq)
set.seed(2)
baseline.noord <- glm(covered~logitsize, data=new.noord, family="binomial")
# rsq(baseline.noord,adj=TRUE)
summary(baseline.noord)
## 
## Call:
## glm(formula = covered ~ logitsize, family = "binomial", data = new.noord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9946   0.3636   0.6575   0.6764   0.6819  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.18061    0.11620   1.554     0.12    
## logitsize    0.83659    0.07297  11.465   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9271.4  on 9999  degrees of freedom
## Residual deviance: 9015.5  on 9998  degrees of freedom
## AIC: 9019.5
## 
## Number of Fisher Scoring iterations: 6

2. Baseline model for ordered

  • AIC: 10248
  • r squared adjusted: 0.002630046
set.seed(2)
baseline.ord <- glm(covered~logitsize, data=new.ord, family="binomial")
# rsq(baseline.ord,adj=TRUE)
summary(baseline.ord)
## 
## Call:
## glm(formula = covered ~ logitsize, family = "binomial", data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6592   0.6588   0.6941   0.6955   0.6969  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.71674    0.12474   5.746 9.16e-09 ***
## logitsize    0.41448    0.08464   4.897 9.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10244  on 9998  degrees of freedom
## AIC: 10248
## 
## Number of Fisher Scoring iterations: 5

The logitsize only predictor model is better for the non-ordered sets than the ordered sets (lower r^2 adj), although there is a lot of noise.


3. Ordered: logitsize plus OR

  • AIC: 8788.8
  • r squared adjusted: 0.1369296
set.seed(2)
ord.OR <- glm(covered~logitsize+OR, data=new.ord, family="binomial")
# rsq(ord.OR,adj=TRUE)
summary(ord.OR)
## 
## Call:
## glm(formula = covered ~ logitsize + OR, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5720   0.2732   0.4084   0.8645   1.1941  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -16.40243    0.51735 -31.705   <2e-16 ***
## logitsize    -0.11279    0.07082  -1.593    0.111    
## OR           16.52438    0.49434  33.427   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10281.8  on 9999  degrees of freedom
## Residual deviance:  8782.8  on 9997  degrees of freedom
## AIC: 8788.8
## 
## Number of Fisher Scoring iterations: 5

By incorporating our new measures of entropy, we hope to decrease AIC and increase r squared.


4. Ordered: logitsize plus entropy measures


4a. Logitsize+entropy
  • AIC: 10202
  • r squared adjusted: 0.007772183
set.seed(2)
ord.ent <- glm(covered~logitsize+entropy, data=new.ord, family="binomial")
# rsq(ord.ent,adj=TRUE)
summary(ord.ent)
## 
## Call:
## glm(formula = covered ~ logitsize + entropy, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0757   0.6405   0.6650   0.6937   1.3071  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.83918    0.20898   8.801  < 2e-16 ***
## logitsize    1.14021    0.14639   7.789 6.77e-15 ***
## entropy     -0.41733    0.05906  -7.067 1.59e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10196  on 9997  degrees of freedom
## AIC: 10202
## 
## Number of Fisher Scoring iterations: 5

4b. Logitsize+max_pp
  • AIC: 10205
  • r^2 adjusted: 0.007384386
set.seed(2)
ord.max <- glm(covered~logitsize+max_pp, data=new.ord, family="binomial")
# rsq(ord.max,adj=TRUE)
summary(ord.max)
## 
## Call:
## glm(formula = covered ~ logitsize + max_pp, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0494   0.6446   0.6634   0.6928   0.8714  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4328     0.1511   2.864  0.00418 ** 
## logitsize     0.7580     0.1124   6.744 1.55e-11 ***
## max_pp       -0.8948     0.1307  -6.844 7.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10199  on 9997  degrees of freedom
## AIC: 10205
## 
## Number of Fisher Scoring iterations: 5

4c. Logitsize+KL
  • AIC: 10202
  • r^2 adjusted: 0.007772183
set.seed(2)
ord.KL <- glm(covered~logitsize+KL, data=new.ord, family="binomial")
# rsq(ord.KL,adj=TRUE)
summary(ord.KL)
## 
## Call:
## glm(formula = covered ~ logitsize + KL, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0757   0.6405   0.6650   0.6937   1.3071  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.08269    0.18765  -0.441    0.659    
## logitsize    1.14021    0.14639   7.789 6.77e-15 ***
## KL           0.41733    0.05906   7.067 1.59e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10196  on 9997  degrees of freedom
## AIC: 10202
## 
## Number of Fisher Scoring iterations: 5

4d. Logitsize+AUCDF
  • AIC: 10205
  • r^2 adjusted: 0.00738425
set.seed(2)
ord.AUCDF <- glm(covered~logitsize+AUCDF, data=new.ord, family="binomial")
# rsq(ord.AUCDF,adj=TRUE)
summary(ord.AUCDF)
## 
## Call:
## glm(formula = covered ~ logitsize + AUCDF, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0494   0.6446   0.6634   0.6928   0.8714  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4238     0.1515   2.797  0.00516 ** 
## logitsize     0.7580     0.1124   6.744 1.55e-11 ***
## AUCDF        -0.8948     0.1307  -6.844 7.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10199  on 9997  degrees of freedom
## AIC: 10205
## 
## Number of Fisher Scoring iterations: 5

4e. Logitsize+slope (of exponential curve fitted to posterior probabilities pdf)
  • AIC = 10223
  • r^2 adjusted = 0.005565675
set.seed(2)
ord.slope <- glm(covered~logitsize+slope, data=new.ord, family="binomial")
# rsq(ord.slope,adj=TRUE)
summary(ord.slope)
## 
## Call:
## glm(formula = covered ~ logitsize + slope, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6821   0.6452   0.6740   0.6917   1.0031  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.72760    0.12474   5.833 5.45e-09 ***
## logitsize    0.40968    0.08463   4.841 1.29e-06 ***
## slope       11.91004    2.23733   5.323 1.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10217  on 9997  degrees of freedom
## AIC: 10223
## 
## Number of Fisher Scoring iterations: 5
4f. Logitsize+int (of exponential curve fitted to posterior probabilities)
  • AIC = 10223
  • r^2 adjusted = 0.005961888
set.seed(2)
ord.int <- glm(covered~logitsize+int, data=new.ord, family="binomial")
# rsq(ord.int,adj=TRUE)
summary(ord.int)
## 
## Call:
## glm(formula = covered ~ logitsize + int, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6598   0.6491   0.6879   0.6908   1.3955  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.76646    0.12584   6.091 1.12e-09 ***
## logitsize    0.40734    0.08512   4.786 1.70e-06 ***
## int         -3.78905    0.77181  -4.909 9.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10217  on 9997  degrees of freedom
## AIC: 10223
## 
## Number of Fisher Scoring iterations: 5
4g. Logitsize+hellinger
  • AIC = 10207
  • r^2 adjusted = 0.007092911
set.seed(2)
ord.hell <- glm(covered~logitsize+hellinger, data=new.ord, family="binomial")
# rsq(ord.hell,adj=TRUE)
summary(ord.hell)
## 
## Call:
## glm(formula = covered ~ logitsize + hellinger, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9932   0.6248   0.6650   0.7028   0.8665  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.7347     0.1407   5.220 1.79e-07 ***
## logitsize     0.7248     0.1086   6.674 2.48e-11 ***
## hellinger    -1.2276     0.1860  -6.600 4.11e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10201  on 9997  degrees of freedom
## AIC: 10207
## 
## Number of Fisher Scoring iterations: 5
4h. Logitsize+kolmogorov
  • AIC = 10203
  • r^2 adjusted = 0.007428732
set.seed(2)
ord.kol <- glm(covered~logitsize+kolmogorov, data=new.ord, family="binomial")
# rsq(ord.kol,adj=TRUE)
summary(ord.kol)
## 
## Call:
## glm(formula = covered ~ logitsize + kolmogorov, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9563   0.6398   0.6638   0.6951   0.9381  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4842     0.1476   3.281  0.00103 ** 
## logitsize     0.7425     0.1101   6.745 1.53e-11 ***
## kolmogorov   -1.1598     0.1661  -6.980 2.94e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10197  on 9997  degrees of freedom
## AIC: 10203
## 
## Number of Fisher Scoring iterations: 5
4j. Logitsize+median
  • AIC = 10213
  • r^2 adjusted = 0.006478231
set.seed(2)
ord.med <- glm(covered~logitsize+median, data=new.ord, family="binomial")
# rsq(ord.med,adj=TRUE)
summary(ord.med)
## 
## Call:
## glm(formula = covered ~ logitsize + median, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9192   0.6214   0.6676   0.7058   0.8279  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.05516    0.18808  -0.293    0.769    
## logitsize    0.63915    0.10161   6.290 3.17e-10 ***
## median      92.74375   15.09105   6.146 7.97e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10282  on 9999  degrees of freedom
## Residual deviance: 10207  on 9997  degrees of freedom
## AIC: 10213
## 
## Number of Fisher Scoring iterations: 5

Unfortunately, none of our measures of entropy seem to improve the model significantly.


5. Run cv.glmnet and see if they all get dropped

set.seed(2)
library(glmnet) 
x_ord        <- as.matrix(new.ord[,c(1:3,5:15)]) 
cv.ord <- cv.glmnet(x_ord, y=new.ord$covered, alpha=1, family="binomial")
coef(cv.ord,s=cv.ord$lambda.1se) # r2 is 0.1757218
FALSE 15 x 1 sparse Matrix of class "dgCMatrix"
FALSE                        1
FALSE (Intercept) -17.54380464
FALSE thr           .         
FALSE logitsize     .         
FALSE nvar          0.02008728
FALSE entropy       .         
FALSE max_pp        .         
FALSE KL            .         
FALSE AUCDF         .         
FALSE slope         2.64496729
FALSE int          -0.69730444
FALSE hellinger     .         
FALSE kolmogorov    .         
FALSE median        .         
FALSE N             .         
FALSE OR           16.49502716

5b. Run cv.glmnet and see if they get dropped WITHOUT OR

  • Same result as just averaging the covered column.
set.seed(2)
library(glmnet) # r2 is 0.001173106
x_ord1        <- as.matrix(new.ord[,c(1:3,5:14)]) # design matrix
cv.ord1 <- cv.glmnet(x_ord1, y=new.ord$covered, alpha=1, family="binomial")
coef(cv.ord1,s=cv.ord1$lambda.1se) # yep, all dropped
FALSE 14 x 1 sparse Matrix of class "dgCMatrix"
FALSE                        1
FALSE (Intercept)  1.318501825
FALSE thr          .          
FALSE logitsize    0.008866395
FALSE nvar         .          
FALSE entropy      .          
FALSE max_pp       .          
FALSE KL           .          
FALSE AUCDF        .          
FALSE slope        .          
FALSE int         -0.753174101
FALSE hellinger    .          
FALSE kolmogorov   .          
FALSE median       .          
FALSE N            .

All get dropped, with or without OR in our model. Notice the r squared is much higher for the model with OR and N.


6. Check correlation of entropy measures with OR



7. Non-linear relationship?

Perhaps the relationship with coverage is non-linear? Unfortunately, this does not seem to be the case. Only improves logitsize model slightly and does not improve anywhere near as much as rcs(OR) does.

covered~ R^2 AIC
logitsize 0.0026 10248
logitsize+rcs(OR) 0.1424 8714
logitsize+rcs(entropy) 0.0076 10207
logitsize+rcs(max_pp) 0.0074 10208
logitsize+rcs(KL) 0.0076 10207
logitsize+rcs(AUCDF) 0.0074 10208
logitsize+rcs(slope) 0.0058 10225
logitsize+rcs(int) 0.0087 10200
logitsize+rcs(hellinger) 0.0079 10204
logitsize+rcs(kolmogoro) 0.0072 10208
logitsize+rcs(median) 0.0078 10205

8. Random forest


With OR

set.seed(2)
library(randomForest)

new.ord$covered <- as.factor(new.ord$covered)
output.forest <- randomForest(covered ~ ., 
                              data = new.ord, importance=TRUE)
varImpPlot(output.forest) 

print(output.forest) 
## 
## Call:
##  randomForest(formula = covered ~ ., data = new.ord, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 20.18%
## Confusion matrix:
##     0    1 class.error
## 0 441 1660  0.79009995
## 1 358 7541  0.04532219
pred <- predict(output.forest,type="prob")

Without OR

How about if we don’t allow OR or N (fixed anyway) in our model?

set.seed(2)
output.forest1 <- randomForest(covered ~ slope+int+entropy+KL+max_pp+AUCDF+nvar+thr+hellinger+kolmogorov+median, 
                              data = new.ord, importance=TRUE)
varImpPlot(output.forest1) 

print(output.forest1) 
## 
## Call:
##  randomForest(formula = covered ~ slope + int + entropy + KL +      max_pp + AUCDF + nvar + thr + hellinger + kolmogorov + median,      data = new.ord, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 21.73%
## Confusion matrix:
##     0    1 class.error
## 0  43 2058   0.9795336
## 1 115 7784   0.0145588

Plot the prediction from the random forest model (with OR) against each entropy measure

It seems that because there is so much noise in the model, none of the entropy measures are good at predicting coverage.


Summary

  • None of the entropy measures seem helpful in predicting coverage. Both linear and non-linear effects have been studied for each measure, but the AIC or BIC does not decrease close to that for the logitsize+OR model.

  • Have also tried a simulated dataset with everything fixed (as in ‘new’) but this time with OR values up to 1.4, the entropy measures still do not improve the model. [See new1 for OR up to 1.4 code].

———————————–

Examining the effects of entropy for varying OR

1000 posterior probability systems were simulated for 6 different OR values to understand how the OR affects the systems with everything else fixed (N0=N1=1000, LD1=0.3, LD2=0.3, MAF=0.5, nsnps=100).

Summaries for the p-values and posterior probabilities are given below:


The following histograms show the distribution of values for the entropy measures for the simualtions.

Explanation: Entropy is low for low power systems, as most posterior probabilities will lie close to the y=1/nsnps line. Entropy increases for higher power systems but perhaps not as much as we’d have liked.

Explanation: The maximum moves from very close to 0 to very close to 1. Again, this seems like an appropriate measure, however it does not distinguish well between low OR systems.

Explanation: The median moves closer to 0 for higher OR systems, as there is one pp holding all the probability, and all the others have to share the small amount that is left. It seems that this measure would struggle for OR values between 1 and 1.25 (most likely OR values).

Explanation: Seems like a good measure to capture entropy, struggles for lower OR values.

Explanation: Seems like a good measure to capture entropy, struggles for lower OR values.

Explanation: Seems like a good measure to capture entropy, struggles for lower OR values.


Summary and questions

  • Why are there some seemingly high power systems in the OR=1 group (see image below for example)? Where is this power coming from (note this isn’t the CV…)?

  • Several of the measures (kolmogorov distance, hellinger distance, AUCDF, max, median) have near identical distributions for OR=1 and OR=1.1, suggesting they would struggle to capture information about OR for low power systems. This seems to be because the distributions of posterior probabilities for OR=1 and OR=1.1 are extremely similar.

  • I think our focus should shift to how we can use this KL measure instead of OR as a predictor of coverage.